k-nn model
Fast and Interpretable Machine Learning Modelling of Atmospheric Molecular Clusters
Seppäläinen, Lauri, Kubečka, Jakub, Elm, Jonas, Puolamäki, Kai
Understanding how atmospheric molecular clusters form and grow is key to resolving one of the biggest uncertainties in climate modelling: the formation of new aerosol particles. While quantum chemistry offers accurate insights into these early-stage clusters, its steep computational costs limit large-scale exploration. In this work, we present a fast, interpretable, and surprisingly powerful alternative: $k$-nearest neighbour ($k$-NN) regression model. By leveraging chemically informed distance metrics, including a kernel-induced metric and one learned via metric learning for kernel regression (MLKR), we show that simple $k$-NN models can rival more complex kernel ridge regression (KRR) models in accuracy, while reducing computational time by orders of magnitude. We perform this comparison with the well-established Faber-Christensen-Huang-Lilienfeld (FCHL19) molecular descriptor, but other descriptors (e.g., FCHL18, MBDF, and CM) can be shown to have similar performance. Applied to both simple organic molecules in the QM9 benchmark set and large datasets of atmospheric molecular clusters (sulphuric acid-water and sulphuric-multibase -base systems), our $k$-NN models achieve near-chemical accuracy, scale seamlessly to datasets with over 250,000 entries, and even appears to extrapolate to larger unseen clusters with minimal error (often nearing 1 kcal/mol). With built-in interpretability and straightforward uncertainty estimation, this work positions $k$-NN as a potent tool for accelerating discovery in atmospheric chemistry and beyond.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Finland > Uusimaa > Helsinki (0.05)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
- (2 more...)
- Research Report (0.64)
- Workflow (0.46)
A Neighbourhood Framework for Resource-Lean Content Flagging
Sarwar, Sheikh Muhammad, Zlatkova, Dimitrina, Hardalov, Momchil, Dinkov, Yoan, Augenstein, Isabelle, Nakov, Preslav
We propose a novel interpretable framework for cross-lingual content flagging, which significantly outperforms prior work both in terms of predictive performance and average inference time. The framework is based on a nearest-neighbour architecture and is interpretable by design. Moreover, it can easily adapt to new instances without the need to retrain it from scratch. Unlike prior work, (i) we encode not only the texts, but also the labels in the neighbourhood space (which yields better accuracy), and (ii) we use a bi-encoder instead of a cross-encoder (which saves computation time). Our evaluation results on ten different datasets for abusive language detection in eight languages shows sizable improvements over the state of the art, as well as a speed-up at inference time.
- Europe (1.00)
- North America > United States > Minnesota (0.28)
- Government (1.00)
- Law (0.68)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.99)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
DNN or $k$-NN: That is the Generalize vs. Memorize Question
Cohen, Gilad, Sapiro, Guillermo, Giryes, Raja
This paper studies the relationship between the classification performed by deep neural networks and the $k$-NN decision at the embedding space of these networks. This simple important connection shown here provides a better understanding of the relationship between the ability of neural networks to generalize and their tendency to memorize the training data, which are traditionally considered to be contradicting to each other and here shown to be compatible and complementary. Our results support the conjecture that deep neural networks approach Bayes optimal error rates.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
- North America > United States > Texas (0.04)
- North America > United States > North Carolina (0.04)
Building & Improving a K-Nearest Neighbors Algorithm in Python
The K-Nearest Neighbors algorithm, K-NN for short, is a classic machine learning work horse algorithm that is often overlooked in the day of deep learning. In this tutorial, we will build a K-NN algorithm in Scikit-Learn and run it on the MNIST dataset. From there, we will build our own K-NN algorithm in the hope of developing a classifier with both better accuracy and classification speed than the Scikit-Learn K-NN. The K-Nearest Neighbors algorithm is a supervised machine learning algorithm that is simple to implement, and yet has the ability to make robust classifications. One of the biggest advantages of K-NN is that it is a lazy-learner.
Determining Song Similarity via Machine Learning Techniques and Tagging Information
Cunha, Renato L. F., Caldeira, Evandro, Fujii, Luciana
The task of determining item similarity is a crucial one in a recommender system. This constitutes the base upon which the recommender system will work to determine which items are more likely to be enjoyed by a user, resulting in more user engagement. In this paper we tackle the problem of determining song similarity based solely on song metadata (such as the performer, and song title) and on tags contributed by users. We evaluate our approach under a series of different machine learning algorithms. We conclude that tf-idf achieves better results than Word2Vec to model the dataset to feature vectors. We also conclude that k-NN models have better performance than SVMs and Linear Regression for this problem.
- Media > Music (0.94)
- Leisure & Entertainment (0.94)